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Abstract 


The current tendency of flight control system designs is towards in- 
creased integration of applications and increased distribution of compu- 
tational elements. The reliability analysis of such systems is difficult be- 
cause subsystem interactions are increasingly interdependent. Researchers 
at NASA Langley Research Center ha ve b een working for several years 
to extend the capability of Markov modelling techniques to address these 
problems . This effort has been focused in the areas of increased model 
abstraction and increased computational capability. The reliability m,odel 
generator (RMG) is a software tool that uses as input a graphical object- 
oriented block diagram of the system. RMG uses a failure modes-effects 
algorithm to produce the reliability model from the graphical description. 
The ASSURE software tool is a parallel processing program that uses 
the semi-Markov unreliability range evaluator (SURE) solution technique 
and the abstract semi-Markov specification interface to the SURE tool 
(ASSIST) modelling language. A failure modes-effects simulation is used 
by ASSURE. These tools were used to analyze a significant portion of a 
complex flight control system. The successful combination of the power 
of graphical representation , automated model generation , and parallel 
computation leads to the conclusion that, distributed fault- tolerant system 
architectures can now be analyzed. 


Introduction 

High reliability in digital systems is achieved, in a 
typical design, through redundancy and dynamic re- 
configuration. Markov model solution techniques are 
commonly used when comput ing the reliability of this 
type of system. The state transition matrix represen- 
tation of a Markov model is useful for expressing the 
sequence dependencies that can occur during a series 
of system failures and subsequent recoveries. How- 
ever, distributed, fault-tolerant, and real-time sys- 
tems result in extremely large and complex models. 
One conclusion of the integrated airframe/propulsion 
control system architecture (IAPSA) program (ref. 1) 
is that two factors limit the use of Markov models on 
the systems being proposed for the next generation 
of aerospace vehicles. 

The first factor limiting the use of Markov models 
is that the state space grows exponentially with sys- 
tem size. This growth confines the size of the system 
that can be analyzed to one that can be accommo- 
dated by the available computing resources. One ex- 
ample is the Hybrid Automated Reliability Predictor 
(HARP) (ref. 2). The HARP program presents the 
user with a high-level interface consisting primarily 
of fault tree input (to describe system failure states) 
and fault/error-handling models (to describe recov- 
ery processes). This input is then translated into a 
Markov model and solved. To limit the size of the 
reliability model, HARP uses a process of behavioral 


decomposition, aggregation, and truncation at the 
third level. An estimate of the resulting model size 
for a system with n components is given by 

Total number of states = ^ ^ ^ ^ ) + (3) 

( 1 ) 

Now, consider the IAPSA architecture, which 
consists of over 500 components. The approxima- 
tion in equation (1) yields 21 million states. This 
approximation does not consider that, as in IAPSA, 
component dependencies limit the extent to which 
states can be aggregated. As discussed in a sub- 
sequent section, an IAPSA submodel with 80 com- 
ponents produced 27 million states. The magnitude 
of this problem is enormous. 

The second factor limiting the use of Markov 
models is the difficulty in constructing a model of 
a large distributed and integrated system. The com- 
plex interdependencies confound the analyst’s under- 
standing of system behavior. Again, with IAPSA as 
an example, a single failure of a processing channel 
has the potential to effect three redundancy manage- 
ment regimes: the processor, the I/O network, and 
the I/O devices. These relationships, which can be 
significant, are at times obscured and threaten the 
accuracy of the model. 

Researchers at NASA Langley Research Center 
have been working for several years to extend the 


capability of Markov solution techniques to sys- 
tems like IAPSA. These efforts have their founda- 
tion in the semi-Markov unreliability range eval- 
uator (SURE) (refs. 3 and 4) and the abstract 
semi-Markov specification interface to the SURE tool 
(ASSIST) (refs. 5 and 6). The more recent efforts 
that are the subject of this paper include the reli- 
ability model generator (RMG) (refs. 7 and 8) and 
ASSURE. RMG is based on an algorithm for au- 
tomating the failure modes-effects analysis (FMEA) 
that is part of every reliability analysis. RMG uses a 
graphically based object-oriented description of the 
system as input to this algorithm. The output 
of RMG is an ASSIST language description of the 
reliability model. ASSURE combines the ASSIST 
language with the SURE computational technique in 
a parallel program. ASSURE does not need to retain 
state information and therefore does not suffer from 
the state-space storage problem. ASSURE has also 
extended the ASSIST syntax to allow reference to 
a failure modes-effects simulation (FMES). Features 
such as graphical represent at ion , automated model 
generation, parallel processing, and FMES are be- 
ing combined into a tool set that will presumably 
have the pow r er to compute the reliability of large 
fault-tolerant flight control systems. 

In the following sections, three submodels of the 
IAPSA architecture are introduced as a basis for 
discussing RMG, ASSURE, and FMES. The next 
section is a brief description of IAPSA. 

IAPSA Architecture 

The integrated airframe/propulsion control sys- 
tem architecture (IAPSA) (ref. 1) was designed 
to meet the requirements generated when airframe 
and engine control law r s are combined in a high- 
performance military aircraft. Features of the air- 
craft are canards and dual engines with variable 
inlets and vectoring nozzles. 

Figure 1 is a representative block diagram of the 
IAPSA architecture. The architecture is based on 
the advanced information processing system (AIPS) 
building block elements (ref. 9). The AIPS building 
blocks have been designed to provide fundamental 
system resources for a wide spectrum of aerospace ap- 
plications. The building blocks include fault-tolerant 
processors (FTP's), network interfaces (NI’s), nodes, 
links, and device interface units (DIU’s). The FTP's 
can be configured as quad or triplex redundant com- 
puters. Nodes and links are used to construct re- 
pairable mesh networks. In operation a mesh net- 
work is configured as a bus; that is, the links on each 
node are statically enabled or disabled such that ev- 
ery node can be reached. The I/O devices are con- 


nected to the network through the DIU’s. If a failure 
occurs on the network, the path with the failure is 
disabled and an alternate path is enabled. If this 
repair can be accomplished quickly, one mesh net- 
work can service the entire vehicle. In practice, using 
two networks is necessary, one to control the aircraft 
while the other is repaired. 



Figure 1. IAPSA architecture. 


A quad FTP has the major responsibility for air- 
frame control. Connected to it are two I/O mesh 
networks, one of which must be functioning for safe 
operation of the aircraft. A triplex FTP is used for 
each engine where again dual mesh networks handle 
the I/O traffic. A triplex mesh network provides a 
highly reliable data path for interprocessor commu- 
nication. In total, the architecture consists of 10 pro- 
cessor channels, 20 NI’s, 50 nodes, 90 links, 36 DIU’s, 
and 300 I/O devices for a total of 490 components. 
This design does not include the components neces- 
sary to establish the inter processor link. This part of 
the system has not yet been designed, but analysts 
estimate that about 100 NTs, nodes, and links w T ould 
be used to implement it. 

Reliability Model Generator 

The reliability model generator was designed as 
a tool for system designers. Working from a data 
base of building blocks, designers can construct a 
graphical block diagram of the system. When the 
design is finished, an automatic failure modes-effects 
analysis is performed with data associated with the 
graphical building block objects. The result of the 
FMEA is then translated into a reliability model in 
the ASSIST language (refs. 5 and 6). 

The automated FMEA is implemented with an 
object-oriented data base approach conceived by The 
Boeing Company (refs. 7 and 8). In the data base, a 
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building block has graphical attributes of the build- 
ing block itself as well as its inputs and outputs. 
The building block has data attributes of component 
modes and mode transition functions. The input and 
output have data attributes of effect messages and 
output transition functions. 

Examples of component modes are GOOD, 
FAILED ACTIVE, and FAILED PASSIVE. Com- 
ponent modes are closely related to reliability model 
state variables. Mode transition functions control 
mode state changes. The mode transition functions 
are similar to the transition rules found in ASSIST. 
A mode transition function can have as its input 
the current mode, the value of the building block 
input and output, and a rate. Thus, a mode tran- 
sition function can specify that if a building block 
is GOOD, then it may become FAILED ACTIVE at 
rate A. 

Building block output effect messages take on val- 
ues such as NOMINAL, ERROR, and NONE. Out- 
put transition functions control the value of the mes- 
sages. Output transition functions have as their 
input the building block input and current compo- 
nent mode. Output transition functions are consid- 
ered to be an instantaneous evaluation of building 
block behavior. These functions are loosely related to 
the death conditions found in ASSIST. For example, 
an output transition function can specify that an out- 
put effect is ERROR if either the component mode 
is FAILED ACTIVE or an input effect is ERROR. 

To perform the automated FMEA, a building 
block representing the system is formed with an 
output that reflects the system’s condition and with 
inputs from other building blocks. RMG is then 
directed to analyze the system for conditions leading 
to an ERROR output of the system. A backward- 
chaining technique is used to trace this failed state 
throughout the system. As the state is traced, this 
technique constructs the core of the reliability model. 

Example 1: FTP Network Interface 

Figure 2 is a diagram of the first example, which 
focuses on the interaction of the FTP with the mesh 
networks. Here the mesh networks are modelled as 
single, repairable components. Three FTP channels 
are connected to each network so that both networks 
function in the event of two channel failures. Ini- 
tially FTP channel 1 (CHI) is controlling network 1 
(NETl) with network interface 1 (Nil) and CH4 is 
controlling NET2 with NI6. The remaining connec- 
tions are disabled. While appearing simple on the 
surface, this model is rich in interdependencies. 



Figure 2. Example 1: FTP network interface. 


Because CHI initially controls NETl and CH4 
controls NET2; four network interfaces (NI2, NI3, 
NI4, and NI5) are not used at this time. An FTP 
channel failure causes the failure of its NI unit(s). 
Thus, a failure of CHI causes Nil to fail. The ef- 
fect of this failure depends on whether or not that 
particular NI unit was controlling its associated net- 
work at the time of the failure. For example, if CHI 
fails from initial conditions (i.e., Nil is controlling 
NETl), then two recovery mechanisms must be acti- 
vated: one to repair the FTP by disabling CHI and 
the other to repair NETl by enabling NI2 as con- 
troller of NETl. If Nil fails from the initial state, 
then a network recovery disables the failed link and 
enables NI2 as the NETl controller on the condition 
that CH2 and NI2 have not yet failed. A subsequent 
failure of CHI results only in a recovery of the FTP 
because CHI is not the current network controller. 
A reliability analysis tool must be able to track such 
dependencies without burdening the user with cum- 
bersome constructs or cryptic tricks. (Sec ref. 10 for 
further discussion.) 

Figure 3 is the RMG block diagram for exam- 
ple 1. This model contains all the elements of figure 2 
with the addition of building blocks representing the 
redundancy-management (RM) routines (FTP RM 
and NETn RM) and the SYSTEM building block. 
The following description lists the component at- 
tributes and explains how RMG uses these attributes 
to perform the automated FMEA. 
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Figure 3. RMG diagram for example 1. 


The FTP channels (CHI -4) have the following attributes: 


Component modes: 

Inputs: 

Outputs: 

Mode transition functions: 

Output modes: 

Output transition functions: 


(GOOD, FAILED, REMOVED); 

CH .STATUS; 

IF (mode=GOOD) THEN (mode=FAILED) AT failure_rate; 

IF (mode— FAILED) THEN (mode=REMOVED) AT recover _rate; 
(NOMINAL, ERROR, NONE); 

IF (mode=GOOD) THEN (CH _STATUS=NOMINAL) 

ELSE IF (modc=FAILED) THEN (CH _STATUS=ERROR) 

ELSE IF (mode=REMOVED) THEN (CH_STATUS=NONE); 


The FTP RM has the following attributes: 


Component modes: 

Inputs: 

Outputs: 

Mode transition functions: 
Output modes: 

Output transition functions: 


0 ; 

CH_STATUS-1,CH_STATUS_2,CH_STATUS_3, CH.STATUS 4; 
FTP .STATUS; 


(NOMINAL, ERROR); 

IF number _of ( (CH.STATUS_l=NOMINAL), 
(CH_STATUS_2=NOMINAL), 
(CH_STATUS_3=NOMINAL), 
(CHJSTATUS-4=NOMINAL)) > 
number _of ( (CH_STATUS_l=ERROR), 

( CH .STATU S _2=ERROR) , 
(CH.STATUS_3=ERROR), 
(CH.STATUS-4=ERROR) ) THEN 
FTP .STATUS —NOMINAL 


ELSE 

FTP_STATUS=ERROR; 



















The FTP RM block is an instantaneous evaluation of the state of the FTP and thus does not require 
component modes or mode transition functions. The RMG provides a convenient number .of function that 
accumulates the number of TRUE conditions found in the argument list. Here, the output transition function 
uses the number .of function to perform a simple majority vote evaluation. In the case of a quad vote, two or 
more inputs receiving ERROR status cause the FTP RM block to transmit an ERROR status. The effect of 
a recovery of a failed channel is to send a NONE status, which protects the FTP from failure on a subsequent 
channel failure. 


The NI components have the following attributes: 


Component modes: 

Inputs: 

Outputs: 

Mode transition functions: 
Output modes: 

Output transition functions: 


(GOOD, FAILED); 

(CH.STATUS); 

(NLSTATUS); 

IF (mode=GOOD) THEN (mode=FAILED) AT failure .rate; 
(NOMINAL, ERROR); 

IF (mode=GOOD) and (CH_STATUS=NOMINAL) THEN 
NI_STATUS=N0MINAL 
ELSE 

NI.STATU S=ERROR; 


The formulation of the output transition function causes the NI component to produce an error message 
output when the host channel fails. Thus, to those components connected to the NI outputs, the NI itself 
appears to have failed. 


Component modes: 

Inputs: 

Outputs: 

Mode transition functions: 


The NET RM components have the following attributes: 

(MODE1, MODE2, MODE3); 

NLSTATUS 1 , NI STATUS-2, NI.STATUS.3; 

NET _RM .STATU S ; 

IF (mode=MODEl) and (NLSTATUS _l=ERROR) THEN 
IF (NLSTATUS 2=NOMINAL) THEN 
(rnodc-MODE2) AT recovery .rate; 

ELSE IF (NLSTATUS .3=NOMINAL) THEN 
(mode=MODE3) AT recovery .rate; 

IF (mode=MODE2) and (NLSTATUS_2=ERROR) THEN 
IF (NI.STATUS_3=NOMINAL) THEN 
(mode=MODE3) AT recovery .rate; 

(NOMINAL, ERROR); 

IF (mode— MODE 1) THEN 

(NET-RM_STATUS=NLSTATUS_1); 

IF (mode=MODE2) THEN 

(NET .RMJ5TATU S =NLSTATU S _2) ; 

IF (mode=MODE3) THEN 

(NET.RM_STATUS=NI-STATUS_3) ; 

The NET RM block uses the status outputs of the three NI components to determine its operating mode. 
The operating mode corresponds to which NI (and therefore which FTP channel) is controlling the network. 
The status of the controlling NI is propagated as the NET RM output to the NET component. 


Output modes: 

Output transition functions: 


The NET components have the following attributes: 


Component modes: 

Inputs: 

Outputs: 

Mode transition functions: 

Output modes: 

Output transition functions: 


(GOOD, FAILED); 

NET.RM.STATUS; 

NET .STATUS; 

IF (mode=GOOD) THEN (mode=FAILED) AT failure .rate; 

IF (mode=FAILED) THEN (mode=GOOD) AT recovery .rate; 
(NOMINAL, ERROR); 

IF (mode=GOOD) THEN (NET _STATUS=NET_RM.STATUS) 
ELSE (NET-STATUS=ERROR); 
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The NET component is assumed to be infinitely repairable; that is, the NET has inexhaustible spares. 
However, when the NET RM indicates an NI failure, the NET propagates an ERROR indication until the 
NET RM replaces the failed NI (if possible). 

The SYSTEM building block has the following attributes: 


Component modes: 

Inputs: 

Outputs: 

Mode transition functions: 
Output modes: 

Output transition functions: 


0 ; 

NET.STATUS_1, NET_STATUS_2, FTP_STATUS; 
SYSTEM .STATUS; 

(NOMINAL, ERROR); 

IF ( (NETJSTATUS_l=NOMINAL) or 
(NET_STATUS-2=NOMINAL) ) and 
(FTP_STATUS=NOMINAL) THEN 
(SYSTEM_STATUS=NOMINAL) 

ELSE 

( S Y STEM _S TATUS=ERROR) ; 


The SYSTEM building block contains an output 
transition function that modifies the SYSTEM mode 
from NOMINAL to ERROR in the event that either 
both network output effect messages are ERROR or 
the FTP RM output message is ERROR. 

The SYSTEM building block is used as a starting 
point for the FMEA. The conditions in the SYSTEM 
output transition that, contribute to an ERROR con- 
dition arc traced back, assembled, and reduced to dis- 
junctive normal form. 1 These conditions can then be 
listed as DEATHIF statements in the ASSIST model 
description. Mode transition functions are resolved 
and used as model expansion rules (called TRANTO 
rules) in ASSIST. 

Results 

Models of example 1 were both manually coded 
and automatically generated with RMG. The mod- 
els appeared to be very different. RMG produced 
an exhaustive expansion of the system. The man- 
ually coded model was more compact in a situa- 
tion analogous to comparing manually written assem- 
bler code to compiler-generated code. The computed 
reliability for the two models differed by a small 
amount (fifth decimal digit). Comparing the mod- 
els for equivalence uncovered an interesting discrep- 
ancy. The RMG-gcncrated model achieved a more 
thorough expansion of the state space. In the man- 
ually coded model, some network failures were inad- 
vertently omitted. The RMG-generated model took 
about five times as long to process because of both 

1 Given a logical expression that consists of a series of sub- 
expressions that are connected by AND or OR, disjunctive 
normal form is a reduction of the logical expression to one 
that is a series of subexpressions connected by OR where the 
subexpressions contain only AND logical functions. 


the size of the model representation (the ASSIST 
code) and the larger state space that RMG covered. 
The automatically generated model size is almost 
three times larger than the manually coded model 
(see table I). 

Table I. Example 1 Performance Metrics 


Parameter 

Model 

Manually 

coded 

RMG 

generated 

Number of states 
ASSIST time 
SURE time 

1555 
105 sec 
420 sec 

4466 
1810 sec 
633 sec 


Discussion 

The particular diagram shown in figure 3 is not an 
ideal graphical representation. Having the display of 
subcomponents (such as the NTs) somehow represent 
the particular relationship between the subcompo- 
nents and their parent components is preferred. For 
example, the NI communicates with an FTP channel 
and is critically dependent on the FTP channel. The 
NI is a subcomponent and should be viewed as such. 
(See fig. 2.) 

Redundancy management routines arc more diffi- 
cult to represent. A redundancy management routine 
is what turns a discrete set of computer channels into 
a fault- tolerant computer. Yet, the redundancy man- 
agement routine is not a component typically pic- 
tured in a block diagram as it is in figure 3. As for 
the FTP, the FTP RM might be better expressed by 
explicitly showing the interchannel linkages and vot- 
ers as subcomponents that are part of the actual FTP 
channel architecture. However, this type of expres- 
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sion does not work for the NET RM. The NET RM 
organizes the NTs, which are part of an FTP channel, 
and the mesh network into a fault-tolerant network. 
As modelled in this example, network recoveries are 
generated in both the NET RM and the NET com- 
ponent. This adaptation was unavoidable because 
of limitations in the version of RMG used to gen- 
erate this example. When considering a better way 
to represent the NET RM, it is difficult to imagine a 
clean construct that can be added to each component 
of this assemblage and be able to describe the NET 
RM function. The redundancy management routines 
are thus best described as separate objects whose at- 
tributes can be related to other components with a 
graphical device such as color or a unique icon. This 
concept will be considered in future versions of the 
software. 


ASSURE 

Given the capability to automatically generate 
a model, the problem immediately becomes one 
of computing the extremely large models that will 
certainly follow. The ASSIST/SURE combination 
has the drawback that the entire state space must 
be generated by ASSIST and searched by SURE. 
While methods of pruning the state space and path 
depth have been developed for both ASSIST and 
SURE, modest models of a few dozen interdependent 
components quickly tax current workstations. 

The ASSIST modelling language has been com- 
bined with the SURE solution technique in a relia- 
bility analysis tool (ASSURE) in which state-space 
storage requirements are minimized. The SURE so- 
lution technique provides for the calculation of a 
Markov model as the expansion of a series of indepen- 
dent paths (ref. 4). The ASSIST modelling language 
describes how these paths are grown (ref. 5). In 
ASSURE, the ASSIST language is translated into C, 
linked with SURE solution procedures, and executed 
to solve the model. The state probabilities can then 
be calculated as the model is grown. Two mecha- 
nisms are available to reduce model size. With ac- 
cess to the state probabilities, an informed decision 
can be made as to when to terminate path growth 
(e.g., when state probability <10~ 14 ). Also, be- 
cause the only state of consequence at any time is 
the state being expanded, when expansion is com- 
plete, the state can be discarded. Thus, ASSURE 
does not need to maintain the complete state space 
in memory. Also, because the paths through the 
model are independent, the ASSURE program can 
be parallelized. 


Example 2: Nodes, Links, and Devices 

Figure 4 illustrates a problem generated to test 
the capability of ASSURE. The system is an evo- 
lution of example 1 with the addition of a two-layer 
network and I/O devices. A mesh network could not 
be modelled initially because of the difficulty in ex- 
pressing the network regrow algorithm in the ASSIST 
language. (This difficulty was later rectified. See 
section entitled “Failure Modes-Effects Simulation.’ ) 
The I/O devices are quad redundant, use majority 
voting, and have redundancy management routines 
similar to those of FTP. 



( ) FTP channel 
■■ Network interface 
— Link 
• Node 

^ Device A 
O Device B 
O Device C 
O Device D 


Figure 4. Example 2: FTP network interface with two-layer 
network and I/O devices. 

Results 

ASSIST produced a reliability model for the 
system in figure 4; the model contained over 
40 000 states and 1 000 000 transitions (with no prun- 
ing). Direct comparison with ASSURE is not possi- 
ble because ASSURE does not aggregate states when 
it produces the model. Model statistics (reliabil- 
ity and pruning bounds, number of pruned paths, 
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Tabic II. Example 2 Performance Metrics 



Processor 

SURE model size 

Run time, hr 

Memory usage 

ASSIST/SURE 

SUN 3/150 

27 Mbyte 

11.50 

100 Mbyte 

ASSURE (Serial) 

SUN 3/150 

NA 

0.60 

1 Mbyte 

ASSURE (Parallel) 

32 iPSC/860 

NA 

0.01 

1 Mbyte per node 


and pruning error) for SURE and ASSURE were 
identical; thus, ASSURE computed the model cor- 
rectly. ASSURE exists both in serial and parallel 
form. The test runs for the serial version of ASSURE 
and ASSIST/SURE were performed on a SUN 3/150 
processor. The parallel version of ASSURE was exe- 
cuted on a 32-node iPSC/860 hypercube. The serial 
ASSURE program execution was 10 times faster and 
used 100 times less memory than ASSIST/SURE. 
Parallel ASSURE increased this performance another 
100 times. (See table II for details.) Overall, a 
speed increase on the order of 3 orders of magnitude 
is realized over the original ASSIST/SURE solvers. 
The processors in the hypercube are typically over 
90 percent utilized. 

Note, ASSURE is a prototype and thus does not 
perform extensive error checking (as docs SURE). 
If extensive error checking were performed it would 
reduce the observed improvement. However, given 
the degree of efficiency of the parallel version, a 
great deal of improvement will always be obtained 
with parallelization. Serial ASSURE benefits from 
not having to maintain the complete state space in 
memory while computing. As soon as the state space 
outgrows available physical memory, ASSIST/SURE 
suffers performance degradation due to swapping of 
virtual memory. 

Failure Modes-Effects Simulation 

As previously mentioned, expressing the mesh 
network regrow algorithm in the ASSIST language 
is difficult. Two possible methods are exhaustive 
enumeration (which is almost immediately ruled out) 
and the division of the algorithm into discrete steps. 
The division method is possible but presents a con- 
fusing model because each step in the process must 
be assigned a rate and therefore produces another 
state with subsequent children states. 

An alternative approach takes advantage of the 
ASSURE translation of ASSIST into C code. Thus, 
the regrow algorithm can be coded in a C procedure 
and ASSURE can reference this procedure at the 
appropriate time. Studying this approach revealed 
that an extension of the ASSIST syntax was nec- 
essary. Further work using the extension to ASSIST 


led to an approach in which the concept of automated 
FMEA fostered by RMG is incorporated as C code in 
ASSURE. This concept is the failure modes-effects 
simulation. 

ASSIST Extensions 

The basic components of an ASSIST model de- 
scription are the state vector, model expansion rules, 
and model termination rules. The reliability model 
is produced by repeatedly applying the model ex- 
pansion rules to a state vector and thus creating new 
state vectors. The process continues until the list 
of state vectors is exhausted. A model expansion 
rule (called a TRANTO statement) is composed of a 
conditional expression, a state translation expression, 
and a rate. A transition in a reliability model is thus 
completely defined by its starting state (identified by 
the conditional expression), its ending state (defined 
by the translation expression), and the rate at which 
the transition occurs. Model growth is terminated by 
checking the new state against the model termina- 
tion rules (conditional expressions called DEATHIF 
statements). Death states are not expanded. System 
unreliability is calculated as the total probability of 
entering a death state before the end of the mission 
time. 

The ASSIST language was extended to allow ref- 
erence to two types of C functions termed conditional 
functions and effect functions . A conditional func- 
tion takes as input the state vector and returns a 
value of TRUE or FALSE. Conditional functions are 
used in DEATHIF statements and the conditional 
part of TRANTO statements. An effect function is 
used in place of the state translation expression of a 
TRANTO statement. 

Figures 5(a) and 5(b) show models of a simple 
quad FTP in standard ASSIST (fig. 5(a)) and ex- 
tended ASSIST (fig. 5(b)). In standard ASSIST, 
the model begins with the declaration of two tran- 
sition rate constants. This declaration is followed 
by a SPACE statement that defines the state vector 
as two four-element arrays (CH_G and CH_B). These 
arrays are of type Boolean and indicate whether FTP 
channels are GOOD (CHJ3) or BAD (CH_B). In the 
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START statement, the state vector is initialized to 
all channels being GOOD. The DEATHIF statement 
supplies a majority vote termination condition. Fi- 
nally, two TRANTO statements (IF . . . TRANTO . . . 
BY . , . ;) supply model growth rules. The TRANTO 
statements are embedded in a FOR loop to scan each 
element of the state vector arrays. 

CH_Fail_Rate = 1.0E-4; 

FTP_Recovery = 3.0E4; 

SPACE = (CH_G: array [1.. 4], CH_B: array! 1.. 4]); 

START = (4 of 1 , 4 of 0); 

DEATHIF CH_B[1]+CH_B[2]+CH_B[31+CH_B[41 >= 
CH_G [ 1 ] +CH_G f 2] +CH_G [ 3 ]+CH_G [41 ; 

FOR i=l,4 

IF CH_G[i]=l TRANTO 

CH_G[i]=0, CH_B[i]=l BY CH_Fail_Rate; 

IF CH_B[il=l TRANTO 

CH_B[i]=0 BY FTP_Recovery; 

ENDFOR; 


(a) FTP model in standard ASSIST. 

CH_Fail_Rate = 1 .OE-4; 

FTP_Recovery = 3.0E4; 

SPACE = (FTP, CH: arrayf 1 ..4]); 
START = (5,4 of 5); 

DEATHIF ERRFTPO; 

FOR i=l,4 

IF GOOD(CH[i]) TRANTO 

CH_FailEff(i) BY CH_Fail_Rate; 
IF RECOVER(FTP) TRANTO 

FTP_RecEff() BY FTP_Recovery; 
ENDFOR; 

(b) FTP model in extended ASSIST. 
Figure 5. Simple quad FTP models. 


In the extended ASSIST model (fig. 5(b)), notice 
the conditional function calls ERRFTPQ, GOOD(), 
and RECOVERQ and effect function calls 
CH_FailEff() and FTP_RecEff(). The model in fig- 
ure 5(b) also reflects a different modelling strategy, 
which is a natural result of the FMES process. Con- 
sider the state vector. Two entities are modelled in 
this system: actual physical components called chan- 
nels (CH[ij) and a super component called FTP. The 
FTP is a logical entity whose state is a collective 
function of the channels’ states. Also, these compo- 
nents no longer have simple Boolean values but can 
take on a range of values as follows: 

GOOD = 1; 

ACTIVE = 2; 


IN.USE = 4; 

ERROR = 8; 

RECOVERING = 16; 

ELIMINATED = 32; 

These values represent single bits in the state variable 
and can be combined to define a component’s state. 
Thus, a component can be GOOD (with state = 1) 
or a component can be GOOD and IN.USE (with 
state = 5). The GOOD + IN.USE value is used to 
initialize the state variables in the START statement 
in figure 5(b). Defining macros to operate on the 
state variables is often helpful. The following macros 
are used in the FMES code: 


SetRecovery(v): 

SetFailError(v): 

SetElim(v): 

SetNotlnUse(v): 

GoodlnUse(v): 

ErrorlnUse(v): 


Sets the RECOVERING bit. 
Sets the ERROR bit and 
clears the GOOD bit. 

Sets the ELIMINATED bit. 
Clears the IN.USE bit. 

Tests state variable for both 
GOOD and IN.USE bits. 
Tests state variable for both 
ERROR and IN.USE bits. 


A Simple FMES 

Figures 5(a) and 5(b) arc practically identical 
with the exception that functions written in stan- 
dard ASSIST have been replaced by function calls in 
extended ASSIST. Conditional functions can take as 
parameters one or more state variables. Effect func- 
tions can pass an integer argument for array index- 
ing. The primary benefit of using extended ASSIST 
is that complex state transitions such as a network 
repair can be coded in algorithmic form instead of the 
exhaustive enumeration sometimes necessary with 
standard ASSIST. A secondary benefit is that the 
resulting ASSIST model is less complicated and thus 
more readable. 

The ASSIST TRANTO statement contains three 
expressions: a condition, a destination state transla- 
tion, and a rate. The failure modes-effects simula- 
tion describes the destination state translation as a 
chain reaction among the components of the system 
using the concepts of component modes and mode 
effect messages developed in RMG. The FMES func- 
tions are grouped into two categories: effect functions 
(which are referenced in ASSIST TRANTO state 
translation expressions) and dependency functions. 
An effect function links the FMES with the ASSIST 
model. The dependency functions propagate the ef- 
fect throughout the system while making the appro- 
priate state changes. Figure 6 shows the FMES for 
the model of figure 5(b). 
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According to the first TRANTO in figure 5(b), 
if a CH is GOOD, then it can fail with effect deter- 
mined by CH_FailEff(). The function CH_FailEff first 
modifies the channel’s state to FAIL + ERROR, then 
it calls dependency function FTPJDepcndson_CH(). 
(See fig. 6.) The FTP dependency function uses the 
voter majority rule to determine the state of the FTP. 
FTP should recover if any channel is producing er- 
rors, and FTP is failed if the error-producing chan- 
nels outnumber the good channels. Setting FTP to a 
recovering state enables the second transition, in fig- 
ure 5(b); this transition uses FTP_RecEff() to obtain 
the effect of the FTP recovery. In FTP_RecEff(), 
error-producing channels arc set to not in use and 
eliminated from the system. 

CH_FaiIEff(my_id) 
int my_id; 

I 

SetFailError(CH[rny_idl); 

FTP_DEPENDSON_CH(); 

1 

FTP_DEPENDSON_CH() 

I 

int i,g,b; 

g=0; b=0; 

for (1= 1 ; i<=4; i++) 

1 

if (GoodlnusefCHfi]) && !Error(CHfi])) |g++;} 
else if (ErrorInuse(CH[i])) {b++;) 

1 

FTP = GOODiilNUSE; 

if (bl=0) (SetRecover(FTP); } 

if (b>=g) II (g==0) { SetFailError(FTP); ) 

1 

FTP_RecEff() 

( 

int i; 

for (i=l; i<=4; i++) 
if (Errorlnuse(CHfiP) 

i 

SetNotlnuse(CHfil); 

SetEIim(CHfil); 

1 

FTP_DEPENDSON_CH(); 

1 

Figure 6. FMES C code for figure 5(b). 

Modelling With FMES 

In the simple quad system, two types of transi- 
tions are modelled: failure transitions and recovery 
transitions. (However, others are possible.) Failure 


transitions can occur at any time to any component. 
The effect of the failure on the system state is deter- 
mined by that component’s fail effect function. Re- 
covery transitions are most often enabled by a com- 
ponent’s fail effect function (although they can be 
triggered by other effects). A recovery is brought 
about by a super component. A super component 
is a set of components that have been grouped to- 
gether to increase reliability. The quad fault-tolerant 
computer is an example of a super component. 

Super components arc responsible for redundancy 
management. When a component fails, its fail effect 
function sets the RECOVER mode descriptor of that 
component’s super component. A good example is 
the quad redundant fault-tolerant computer. This 
super component is called FTP and is composed of 
four channels. When a channel fails, its fail effect 
function sets the RECOVER flag in FTP. The FTP 
recovery effect functions are then called during the 
calculation of the now enabled recovery transition. 

Super components do not have failure transitions, 
yet they are able to fail. Again, with the FTP as 
an example, majority voting is used among its set 
of channels to mask and detect errors. Whether 
or not the FTP super component is operating 
properly is a function of the state of the set of 
CH’s assigned to the FTP as calculated in function 
FTP_DEPENDSON_CH(). A function sensing the 
state of FTP is constructed and called as a death 
condition. If the death condition is met, the FTP has 
failed and thus the system (in this case) has failed. 

A brief description of how the FMES is used in 
ASSURE is as follows: 

1. A set of mode descriptors and effect messages is 
defined and a state vector constructed. 

2. Fail effect and recovery effect functions are de- 
fined. 

3. Condition functions for failure and recovery tran- 
sitions are defined. 

4. Death condition functions are defined. 

5. The model is executed for each component as 
follows: 

a. IF FAIL-CONDITION () TRANTO 
FAIL_EFFECT() BY RATE. 

b. IF RECOVERY .CONDITION () TRANTO 
RECOVERY-EFFECT () BY RATE. 

c. Test for DEATH_CONDITION() for each new 
state. 

d. Compute reliability as model is expanded, 
pruning where possible. 

6. Print results. 
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Example 3: Mesh Network 

Figure 7 illustrates the system configuration for 
this mesh network example. Two network parti- 
tions consisting of 7 nodes and 14 links interface 
FTP with 4 quad redundant 1/ 0 groups. The mesh 
network uses a regrow algorithm to repair failures. 
The I/O devices connect to the network through 
device interface units. This system contains over 
80 components and 7 different redundancy man- 
agement groups or super components (FTP, NET1, 
NET2, and four I/O devices). The computer, two 
networks, and four I/O devices are reconfigurablc 
and controlled by the seven separate redundancy 
management routines. The recovery of a simul- 
taneous failure of a channel of the fault- tolerant 
computer and a node of one of the networks re- 
quires two separate operations. However, the re- 
covery from a simultaneous link and node failure 
on the same network can be accomplished in one 
operation. 

Component mode descriptors and mode 
message . As previously mentioned, the FMES 

is derived from the automated FMEA as used by 
RMG; thus, a set of mode descriptors and mode 
messages must be defined. Although different mode 



0 FTP channel 

wm Network interface 
— Link 
• Node 

1 -j Device interface 
^ Device A 

O Device B 
O Device C 
O Device D 


Figure 7. Example 3: Mesh network. 


descriptors and messages can be defined for each component, it is best to seek, if possible, a set of common 
descriptors and messages that can be used throughout the system. The system in figure 7 is used as an example' 
because it is large and complex anil the set of descriptors and messages needed to define that system should 
suffice for most others. 

Mode descriptors are implemented as bit values that have a meaning associated with their TRUE (set) and 
FALSE (reset) values. In the following descriptions, the first value is associated with TRUE and the value in 
parentheses is associated with FALSE. 


GOOD (FAILED): Describes the component’s physical state. If the component is GOOD, it 

can FAIL at any time. 

ACTIVE (PASSIVE): Describes the nature of fa ilure. An ACTIVE failure is able to produce 

erroneous behavior. A PASSIVE failure is analogous to failing safe. 


ERROR (BENIGN): States that detectable errors arc being produced. 

IN_USE (NOT JN -USE): Used primarily for modelling spares. For example, a component (such as 

a link) that is NOT JNTJSE might not affect the system with an active 
failure. Super component recovery effect functions control the value of this 
descriptor. 


RECOVERING (NORMAL): Used with super components to enable recovery transitions. 

ELIMINATED (MEMBER): Used in recovery effect functions to mark a component as having been 

removed from the set of good components. 

A component typically begins in a GOOD + INTJSE state. A failure can cause it to transition to FAILED 
+ IN_USE + ERROR and can cause its super component to transition from GOOD + IN .USE + NORMAL 
to GOOD + IN-USE + RECOVERING. 
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Mode messages have mutually exclusive values. Because the mode messages are not part of the state 
variable, they can have similar names to convey similar meaning. 

NOMINAL: Indicates that the sending component’s current operation is within specification. 

FAIL: Indicates that the sending component has failed. 

ERROR: Indicates detectable erroneous behavior. 

NONE: Indicates passive failure. 

NIU (Not In Use): Indicates that the sending component has been switched to standby (as a spare). 

Fail effect functions . A fail effect function is named by attaching the term “-FailEff” to a component’s 
state variable name. For example, component CH has fail effect function CH.FailEff. A fail effect function 
has three stages. The first stage alters the component’s mode, which can be, for example, from GOOD + 
IN_USE to FAILED + IN_USE 4- ERROR. A second function can then be called to send the appropriate effect 
messages to this component’s neighbors. This function is named by attaching “ -Dependents” to the component 
name (e.g., in CH -Dependents). Finally, a component calls zero, one, or more super component dependency 
functions. The super component dependency functions can be contained in the “.Dependents” function, but 
it is best to separate them because the super components are different from normal components. 

Dependency functions . A primary dependency function interprets a component’s mode and sends a 
message reflecting the component’s new state to those other components that are immediately affected. The 
messages are sent through use of secondary dependency function calls of the form “X-Dependson.Y(XJd, Y_id, 
Y .message),” where Y is the local component. Thus, the function call X_Dependson_Y(X_id, Y_id, Y.message) 
is found in function Y_Dependents. 

For example, consider the NI which resides in a channel of the FTP (CH). As a result of executing a failure 
transition for component CH[2], fail effect function CH_FailEff(2) is called. This function then calls the primary 
dependency function CH-Depcndents(2). Because two NFs reside in CH[2], two secondary dependency function 
calls are made as follows: 

NI_Dependson_CH(2,2,FAIL); 

NI_Dependson.CH(3,2,FAIL); 

The secondary dependency function alters the receiving component’s state and then calls that component’s 
primary dependency function. The effect of the failure is thus propagated throughout the system. 

A super component dependency function (e.g., FTP.Dependson.CH) differs substantially from a normal 
component’s dependency function. This difference occurs because a super component must have access to the 
state of all components in its domain. For example, FTP_Dependson_CH must be able to read the state of each 
of its channels to determine whether the voter function is error free. Also, in the case of the network, a single 
failed node has the effect of taking the network off-line until the network repairs. Thus, the super component 
function NET_Dependson_NODE must be able to alter the state of all nodes and links in the network to set 
them to NOT JN-USE. 


Recovery effect functions . A recovery effect function is named by attaching the term “.RecEff” to the 
super component’s name (e.g., FTP JRecEff). A recovery effect function examines and alters, if necessary, the 
state of each of the components in its domain. For example, after failing, a CH is in mode FAILED + IN-USE 
+ ERROR. The recovery function changes this to FAILED + NOT JN.USE + ERROR + ELIMINATED; 
the device is now no longer in use or part of the spare pool. The recovery function then calls the component’s 
primary dependency function to propagate the effect of the mode changes. 

Effect of CH[1 ] failure . The complete FMES for this system is not given here because of the amount of 
detail. The following description explains what happens when CH[1] fails: 
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CH_FAILEFF(): CH[1] set to FAILED + IN .USE + ERROR status. CH_Dependents() is 

called to propagate state change. Super component FTP_Depcndson_CH() 
is called. 


CH -Dependents () : 

FTP -Dependson.CH () : 

NI_Dependson_CH() : 

NET 1 .Dependson -NI() : 


NODE_Dependents() : 
DIU_Dependson_NODE(): 
DEVn -Dependson -DIU() : 


NI_Dependson_CH() is called with FAIL message. 

Voter status checked. FTP set to RECOVERING because of error on 
CH(1). 

NI[1] set to FAILED + IN_USE + ERROR status (effect of FAIL message 
from CH). NET 1 -Dependson _NI() is called with ERROR message. 

NI[1] is controller (IN-USE) and sends ERROR, so NET1 sets 
RECOVERING. NET1 also sets all children (NODES and LINKS) to 
GOOD + NOT-INJUSE. NODE_Dependents() and LINK_Dependents() 
functions are called with NOT _IN_USE message. (LINK -Dependents arc 
not traced from this point.) 

For each NODE, message from parent NET is interrogated and corre- 
sponding effect message is sent to the node's attached DIU (if one exists). 

In this case, four nodes send a NOT_IN_USE message to their DIU's. 

In response to the NODE message, the DIU sets its mode to NOT_IN_USE. 
Device component function, DEVn -Dependson _DIU(), is called with 
NOTJNJJSE message. 

In response to the N OT _TN _U SE message sent from the DIU, the I/O 
device sets its mode to NOTTN.USE also. Because the I/O devices are 
quad redundant, super component DEVICE -Dependson _DEVn() is called. 


DEVICE_Dcpcndson_DEVn(): Voter status checked. DEVICE is not set to RECOVER because device 

error is not present (being NOT-INJJSE is not an error condition). 


Upon return to ASSIST, the following state exists: 

CH[1] is set to ERROR. NI[1] is set to ERROR. 

FTP is set to RECOVERING. NET1 is set to 
RECOVERING. 

All NODES, LINKS, DIU’s, and DEVICES on 
NET1 are set to NOT _IN_USE. 

A component such as the DEVn can be both 
GOOD and NOT -IN-USE and still fail to a state of 
FAILED, ERROR, and NOT -IN-USE. If this fail- 
ure occurs upon restoration of the network when 
the DEVn status is changed from NOT_IN_USE 
to IN-USE, then the super component function 
DEVICE_Dependson_DEVn detects the error and 
sets a recovery for the DEVICE. 

Results . Because the FMES is an extension of 
ASSIST, results are only available for serial ASSURE 
and parallel ASSURE. When run on a SUN 3/150 
processor, serial ASSURE took 6.2 hours and pro- 
duced over 27 million transitions. If states are not 
aggregated, then the number of transitions is equiv- 
alent to the number of states. Many of the states are 


equivalent, and direct comparison with other tools 
that would aggregate these states is difficult. How- 
ever, that this system of dynamically reconfiguring 
mesh networks was analyzed in reasonable time on an 
ordinary computer is an accomplishment that has not 
been achieved before. Parallel ASSURE (again using 
a 32-node hypercube) solved this model in a scant. 
1.3 minutes. It is expected that large fault- tolerant 
systems, typical of those found in today’s avionic ar- 
chitectures, can now be analyzed using FMES and 
Parallel ASSURE. 

Concluding Remarks 

The reliability model generator (RMG) and 
ASSURE are prototype programs that have been 
developed to test advanced concepts for the reli- 
ability analysis of future fault-tolerant flight con- 
trol systems. Results of tests using RMG indicate 
that the automated failure modes-effects analysis 
(FMEA) algorithm embedded in RMG successfully 
generated an accurate reliability model from a graph- 
ical block diagram of the system. A drawback of 
this technique may be that the model size is almost 
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three times larger for the automatically generated 
graphical model. 

Combining the processes of the ASSIST and 
SURE programs, the ASSURE program eliminates 
the need to produce and maintain the complete 
model state space in memory. Solving a model 
with over 40 000 states and 1 000 000 transitions, the 
ASSURE program execution was 10 times faster than 
ASSIST /SURE and used 100 times less memory. An- 
other feature of ASSURE is that its solution tech- 
nique can be parallelized and thus can be executed 
on parallel computers such as the hypcrcube. When 
this same model was run on a 32-node hypcrcube, an- 
other hundredfold increase in performance over serial 
ASSURE was obtained. 

To better model complex redundancy manage- 
ment processes, the ASSIST language syntax was ex- 
tended in ASSURE to allow function calls to C lan- 
guage procedures. Drawing on the automated FMEA 
approach pioneered with RMG, a modelling tech- 
nique called failure modes-effccts simulation was used 
to model a large system consisting of one quad fault- 
tolerant computer, two mesh networks, and several 
quad redundant input /output devices. The system 
contained over 80 components and 7 redundancy 
management groups overall. This system produced 
over 27 million transitions and took 6.5 hours to 
complete using the serial version of ASSURE. The 
parallel version was completed in 1.3 minutes. 

These results indicate that the techniques are 
available to represent and solve large, complex re- 
liability models of integrated and distributed flight 
control systems. 


NASA Langley Research Center 
Hampton, VA 23681-0001 
July 23, 1992 
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